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RO CENTER FOR GENETIC ENGINEERING 



GESO^^E SEQUENCLNG PROCEDURE BY m'BRIDlZATION 
VVITH OLlGO>fUCLEOTIDE PROBES 

a) Technical Field 

The present invenuon is in the fieiU of moiccuiar biology. In the intemationai patent classitlcation. ii belooes 
in class 

hi Techntccd Problem 

The size of genomes ranges t'rom about 4x 10* base pairs (bp) in £. coU to 3x lO^bp tn mammals. OeterminiziE 
the pnmary structure or the sequence of the enure genome, particulariy the human geaome. is our challence at the 
end of the 20th century. An even greater challenge for biology is the decennxnatioa of the entire genomic sequence 
for rhimnrmtic species of the living woricL This would provide a qualitative jump in the inteqmtation of the 
funcdoning and evolutioa of organisms. It would also represent a najor jump in the explanif ion and curing of nany 
diseases, in food production and in biotechnology in geaexal. 

c) SiaUoftheArx 

The technology of recombinant DNA has made it possible to replicate and isolate shon fragments of eenomic 
ON A (from 200 to 50,000 bp). In this manner, a sufficient *rnmint of material was obtained for detennining the 
sequence in which the nucleotides in the cloned fngment are arranged. The sequence is determined on 
polyacrylamide gels capable of separating DNA fragments of 1 to a mayimum of 500 bp and differing by the length 
of one nucleotide. The four nucleotides are differentiated in two ways: by specific chemical degradation of the 
DNA strand at sites where the particular nucleotide is located, by the Maxam*OiIben method (Maxam. A,M .« and 
Gilbert, W„ Proc, Natl. Acad. Sci, 74, 560 (1977)1, and by using cnrymaiic DNA synthesis on the cloned matrix 
which involves the addition of a dideoxynucleotide capable of stopping the synthesis at all sites at which this 
nucleotide is located in the cloned fragment, by the mediod of Sanger (Sanger, F., ei al., Proc, Natl. Acad. Sci. 
74, 5463 0977)]. Both methods require a considerable amount of manual work so that the rate of sequencing in 
good laboratories throughout the world is about 100 bp per day per person. By use of electromcs (computers and 
robots), sequencing can be accelerated by a few orders of magnitude, llie idea of sequencing the entire human 
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iTcnome has been discussed ai many scienuric mecuncs m the United States (Science 232. (Research Newsi 
1598-1599 f I9S6>|. The generai conciusion is that seuuenciflg can be accompiisfaed oniy in weii organized centers 
{sequencinL' tactonesi, that the cost wouid be jiboui J billion dollars and that the task would take ai least 10 vcais, 
Japanc:ie experts are currently ahead of all others in organmnc components or such a ceaier. Their sequcnciae 
center has a capacity of about one million bp per day, the cost being 0.17 dollar per genomic bp (Nature 325. 
(Commenury^. 771-772 (1987)]. Becaus e the random selection of cloned fragments containine about 500 bp 
requires sequencing three genome ienghls. the sequencing of iO billion bp in such a center would take 30 vears. 
namely to sequence the human genome alone in a few years, at least 10 such centers would be needed. 

cii Description 

Our scquencmg procedure has an cnurely different logic and is applicable only lo the determmauon of 
sequences of the enure gcnonws): it is uneconomical for the determmauon of specific short fracmeats. The 
procedure is based on stnaly specific hybridization of oligonucleotide probes (ONPs) Chat are 10 to 40 nucleotides 
long. Because hybridization condiuons can be determined when ONPs hybridize only to sequences with complete 
homology, the sequence can be read by such hybtidizatioa. By hybridizing the cadre genomic DNA replicated in 
fragments of approptiace length with a sufficient number of ONPs and by computenzed anangemem of the detected 
sequences, the entire genome can be sequenced at the same •iy^ We believe that this proce dure is several times 
faster and less expensive than the procedure now being developed and that for this reason it could be applicable to 
the sequencing of genomes of all characteristic species. 

For this procedure, it is necessary to optimize the length, sequence and number of the ONPs, the length of 
the genomic DNA fragments that represent t hybridizaaoa point, and the method of separate replication of each such 
fragment. 

The number of possible anangements of the four nucleotides as a ftmcdon of length is equal to 4*-^ and for 
some lengths is shown in the following table 

Length (bp) 9 IQ n 13 

Number 262144 1048576 4194304 16777216 67108864 

On the basis of the foregoing, to detect every possible sequence, namely to accomplish the displacement of 
oniy one bp m the ONP arrangement, it is necessary to use about 260,000 9-mers to 67x10* 13-mers. Because 
specific hybridization is likely to be achieved only with an ONP with 10 nucleotides or more {Wallace, R.B.. 
Nucleic Acids Research 6. 3543-3557 (1979)|, the number of required ONPs will be smallest if a lO-mer or i l-mcr 



AFF001989 



usca. Because or ihe existence oi ru-o compiemcntan- DNA strands, one ONP detects Tv.n different seauences 
when read in a singie sense. For exampic. the i"CACA3* also detects the sequence 5'TCTCy, naxnciy it detects 

5'CACA3* and 5TCTC3" 

3'GTCT5' 3'AGAC5' 

For this reason, oniy one half as many ONPs. namely a maximum of about 2 million I Nnucleotidc ONPs. 

arc needed. Palindromic ONPs are an exception. Among all U-mcrs. 4*. namely 4094 or only one thousandth. 

are palindromic. Conversely, this means that the frequency of nonpalindromic probes in a genome is twice as hi£h 

-* 

For unequivocal sequence dcienninauon. it is not necessary to utilize all ONPs of a given length. The use 
ol a smaller genomic fragment as a hybridizauon pomi makes tt possible to use fewer probes. In this case, the 
probe overlap will be less but still sufficient so thai m the shon genomic DNA fragment sequences of overlap 
lengths will not be repeated many times. 

Ftxna die avenge distance (S) between sites complementary to one ONP, which depends on die ONP length 
and the nuo of its dinudeotide composition to the dinudeotidecomposidon of the geoomic DNA being sequenced, 
it ts possible to detenmoe the frequency of the given sequence along a cenam length of genomic DNA. We used 
equations derived on the basis of the theory of probability [Dnnanac, R., et al., Nuddc Adds Research 14. 

(1986) and our Patent Application No, 5742 of March 24, 1987]. TaWe 1 shows the average distance 
between sequences of certain homologous ONPs in ™fm««i;fl« genomes. 



AFF001990 



Tabic 1 

Avcfacc Distance tS) Becwocfl Seuuences of" 
Homologous ONPs in Macnzzuuian Genomes 



ONP Lencth (bp) 



o 


7 
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9 


10 


11 
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2300 


7600 


25400 


85000 


282000 


I 


3450 


11400 


38100 


127500 


423500 


-I 


5170 


17100 


57150 


191250 


634000 


3 


7750 


25600 


85700 


285000 


951000 


4 


11600 


38500 


128600 


330000 


1427000 


5 


17500 


57700 


190000 


495000 


2140000 


6 


26200 


86600 


285000 


742000 


3210000 


7 


39300 


129900 


330000 


1113000 


4816000 


8 




195000 


495000 


1670000 


7724000 



Fxxnn the avenge ri i^nrr (S) and uang the following equatioa, we ralniliffrf the perceotage of g co o mic 
fragments of length 0 within which the sapience recognized by the given ONP is repeated at least on er* 

P(D) - {I -<1 - 1/S)<n X 100 (1) 

The results are presented in Table 2. 

Table 2 

Petcentage of 5000 to 20,000 bp-long Genomic 
ONA Fragments Containing Sequences of Complementary ONP 
with S Equal to 25.000 to 200,000 bp 

S 

D 25000 50000 100000 200000 

5000 18 9.5 5 2^ 

10000 32 18 9.5 5 

20000 55 33 18 9.5 
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By using the bmomjai distnbution. we determined the probability that the sequence that is compienicnurv to 
the given ONP wiii be repcaaed a cenarn number or times in a DNA tracmcnt of' defined length. This probafaiiitv 
depends on the average distance ( S) separating the given sequence in the genome. This probability can be calcxiiatcd 
by the following equation: 

P(N) = QD.N) X t l/Sr X (1 - 1/5)°^'^ (2) 

wherein D is the length of the DNA tracment in bp and N is the number of repetitions of the given sequence within 
the length D whose probability PCN) is being sought, C(D,N) is the number of coxnbinauons without rcpetitioa of 
class N of D elements. Because within length D there are approximately D sequences which on average have the 
same S. by mulnplying the above probability by D we obtain the number of different sequences of a defined S that 
arc repeated N times within length D. The calculated numbers of different sequences within a given ranee of S, 
D and N are presented in Table 3. 



Table 3 



Number of Oligonucleotide Sequences with S in 
the Range of 25.000 to 200,000 bp and Repeated 
2 aod 3 Titnes in 5000 bp and 10,000 bp Fngmencs 









S 






D 


N 


25000 


50000 


100000 


200000 


5000 


2 


82 


22 


4.5 


L6 




3 


5 


0.75 


0.07 


0,013 


10000 


2 


558 


163 


45 


11 




3 


74 


10 


1.5 


0.18 



From Table 3 it is possible to estimair the minimum required number of ONPa of a defined length tad C+G 
composition accessary for successful reading of sequences in the 5 kbp or 10 kbp long fragmcai. The frequcacy 
of two kinds of sequences is esseatiai for such readiag, nanKly the frequcacy of sequences of compiemcatary ONPs 
and that of the sequences of ONPs with average maximum ONP overiap. This wiii be explained on the example 
of a nucleotide with 1 1 ONPs, 



For about 2x10* ONPs of 1 1 bp length, every I l-mer is detected in any DNA sequence. In this case, overiap 
is always at a maximum and amounts to 10 bp. ll can be seen from Table I that the average distance between the 
most frequent 1 1 -mers is 282,000, If all 4989 I l-mers in the 5000 bp fragment were equally frequent, only one 
of them would be likely to be repeated twice (Table 2), Because there are probably n sequences of 5000 bp length 
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wuh 90% or more or A - T. this means that a more sigmficant repeiiiion of 1 1-mcrs m the iOOO bp rraemem and 
probably also m the 10.000 bp fragment can occur only for nonrantlom rcaiions. With reirard to the overlap 
sequence t lO-mersj, the highest frequency occurs for an average of 85.000 (Table 2). and 2-fold repcution wtthin 
5000 bp wouid occur for a maximum of five such sequences, and probably for an average ot about two seauences. 

The case and acouacy, namely the nonambiguity, of the rtadine of Sfiqiicocgs depends on the number of 
repetitions ot overlapping sequences, ff wc imagine reading as a two-diracasional progression of one or more starts 
(randomly selected ONPs from among all ONPs capable of hybridization to the given genomic DNA fratnneniJ, then 
tor each sunmg ll-mer we look for the left and the right base pair by searching among the hybridized ONPs for 
the 10-mer that is to the left and the nghl of the surtinc i 1-mer. When after a certain number of readinc steps the 
lO-mer is found which because of being repealed in the given sequence is present in more than one i l-mer, the 
reading in this sense must be iniemipted here, because we do not know which of the daccied base pairs arc in the 
continuation of the sequence and which are at some other iocaiioo. By reading in the other sense, this interruption 
will be overcome. Considerable repetition of overlapping sequences, however, will readinc more difficult, 
and it may even become impossible to overcome the interruption. 

On the basis of the cai ni i ated repeittbility , it is possible Co esdznate the lowest ciunber of 1 1 -ntideotide ONPs 
required to prevent the intemiption of s fqu cnce reading or the ambigttous linidng of the read fragments. By 
reducing the number of ONPs, the overlapping vgurnce is shortened and its repeatability is thtis increased. By 
synthesizing a larger number of more frequent ll-men (containing more A and T) and a lesser number of those 
with more C and G, it is possible to achieve die same opdmai repettability of overiapping sequences although of 
different lengths. Assuming thax the maximum repeatability of overlapping sequences resulting in successful reading 
IS about 20 sequences repealed twice, for the sequencing of 5000 bp fragments, the average distance betwtm 
overiapping sequences must not be less dum 50,000 bp. This means that die following needs to be synthesized: all 
ONPs widi one or widiout any C or G (this gives an overiap length of 10 bp); every odier 1 1 -mcr with C + G from 
2 to 4 (this gives an overiap lengdi of 9 bp); every third ll-mer with C + G from 5 to 7 (this gives an overiap 
length of 8 bp), and every fourth ll-mcr widi C + G greater than 7 (this gives an overiap length of 7 bp). TTxo 
total number of ONPs dius selected would be about 10^. la our opinion, computer simulation would show that 
even one half of this number of I l-oudeotidc ONPs would be suffidcnL The sequencing of IO,000-bp fragments 
would require about 10* ll-mcrs. If l2-mers were used, this number would be at least three times hieher. 

For easier reading, syndieiic ONPs can be arranged by starting from one or several ONPs and proceeding 
over the overlapped pans. The ONPs thus arranged uwild be mariced by letters m alphabetical order and according 
to mcreasmg numbers. Such maricing would make it possible to arrange the ONPs that hybridize to a given genomic 
DNA fragment mto one or several arrays which wouid then be convened to the DNA sequence only by deciphering. 
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Tnc rcpiicaiion or genomic DNA in rranmcnts of defmcti icni:m can be accompiiiihed in r*-o ways; n bv 
olomne. and 2) by ampiificacion. 

It can be seen from the foregomc analysis ihai the maximum length that can be read with a reasonable number 
ot ONPs is about 10.000 bp and that 4000-5000 bp is a belter lencih. Plasmid vectors are most advantageous for 
cloainc these lengths. To create a complete genomic library, these vectors, because of their lower transfonnauon 
efficacy compared to phage vectors, require 20 to 100 ^ig of genomic DNA, which is not a major requirement for 
the one-time creation of the library. For a better representation of genomic DNA, it would be necessary to generate 
5000 bp long fragments by partial digestion wuh two to three common enzymes tSau 3 A. Ddel, Ai\x I). To reduce 
the el feet of any 'toxic' and repetitive sequences (in this respect, plasmid vectors have an advanuge over phase 
or cosmid vectors), u is necessary to form a library m two vectors. In our opinion, plasmids of series pUC and 
pAT are most advanugeous for this purpose because they multiply wcil and are relatively small. 

The sequencing of cloned fragments by hybridization can be accomplished in two ways: by coiooy 
hybridization and by dot blot hybridization of isolated plasmid DNA. In both cases, 2000 to 3000 different ONPs 
r e p resen ted in the vector sequence cannot be utilized, i.e., they will not even be syntfaesiszcd- 

Coloay hybridtzatioa ts probably faster and less expensive tfaaa doc blot faybridtZBtioa« but tt cequim specific 
conditions to eliminate the effect of hybridization with bacterial DNA. To reduce geaerxl backgxxjuad aoise, the 
labeling of probes should confer high sensitivity ia hybridizatioa. because in this naumer very atsall colooies cooid 
be used. ONP labeling should in any case be by biocinylizatioa because of easy and lasting labeling in the last 
synthesis step. The seasidvity achieved in this case (Al-Hakin, A.H., and HulL R., Nucleic Acid Reseaxch 14, 
9965-9976 (1986)1 makes it possible to utilize at least 10 times fewer colonies than are required by the staadaxd 
method. 

To avoid false positive hybridizations caused by hoaiology of the ONP with the bacterial sequence and to 
utilize short probes such as the 1 1-mers. which on average are repeated twice in the bacterial chrtimosoaie, it is 
necessary to use vectors giving a maximum number of copies per cell. Il is known that by additional amplificadoa 
on chloramphenicol, pBR 322 can produce 300 to 400 copies per bacterial ceil fUn Chao, S., and Bremer, L., Md. 
Gen, Genet, 2(D, 150-153 (1986)1. The replication efficacy of the plasmids pAT and pUC is at least twice as high 
(Twigg, A.J. ci al,. Nature 2S3, 216-218 (1980)|. so that we can assume that under optimum conditions even 500 
plasmid copies can be produced per ceil. Because of the load represented by the sequence introduced, the chimeric 
plasmids will certainly not multiply as well, paruculariy in the presence of more toxic sequences. For this reason 
it is necessary to woric with about 200 copies of chimeric plasmid per ceil. This means that, on average, with each 
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I l-mcr the siunai would be 100 times stronyer iMhc compiementary sequence were located on the plasmid. Tnis 
represents a sutficicnc difference so thai with a imaii amount or" DNA. namely by use of smali hybndizauon 
colonies, hybridization with the bacterial DNA would not reeister. 

By using the binomiaJ distribution, we deieraiined how many ONPs wiil be repeated in the bacterial 
chromosome more thati 10 times as a result of random distribution. Such ONPs wouid give unreliable infonnauon 
or. if ihey gave approximately the same signal strength with all colonies, they could not be used at ail. 

Table 3 shows the results obtained by use of Eq. 2, whcrem D is the length of the bactenai chromosome, i.e., 
4x 10* bp, and S is the number of different ONPs. This calculation assumes that all nucleotides and dinucleotidcs 
are uniformiy represented in the DNA of E coii, which is almost entirely the case. 

Table 4 

Probability of a Given ll-mcr Frequency 
ta the £. coii Genome 



No. of repetitions (n) 0 


2 


4 


6 


8 


10 


14 


Percent 11 -fDcrs 13^ 


27 


9 


1.2 


0.086 


0.004 


7x10^ 


Total no. of U-oien 






« 


1720 


80 


oa4 



It can be seen fnsm Table 4 that it cannot be expected that any 1 1 -mcr will be repeated more than 13 ^^^^ 
and that 300 is the total number of those that are repeated more than 10 tiiacs, TTxia mcautt that the vast majotity 
of 1 1 -mers will have a more than 20 times stronger signal originating from the cloned DNA than from the bacterial 
DNA. The namrally determined number of ll-tncrs wiU for functional reasons be large m baoenal DNA. but 
because as a result of recombination, bacteria do not tolerate significant repetition, we can expect that the number 
of such 1 l-iners will be smaU. They simply would not be utilized for hybridization. 

The problem of hybridization %vitfa bacterial DNA can also be solved by selective pt«faybridization using 'cold" 
bactenai DNA. By preparing diis DNA in fragments larger dian 100 bp and smaller than about 10,000 bp under 
stringent hybridization conditions in which only fragments with homology greater than 50 bp undergo hybridization, 
bactenai DNA wouid be preferentially -covered-. This is because the probability that there are nmdom homologous 
sequences of 50 bp or longer between bacterial and euicariotic DNA is negligible. 

Selective prchybridizauon also makes a possible to use several probes simultaneously in the colony 
hybridization sequencing procedure. In this manner, the required number of independent hybridizations can be 
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reduced. On the other hand, to determine whtch ONP or ONPs enanie the combination to undergo posuive 
hybridization, each probe must be present m severai combinations, and ihis mcreascs the required amount of each 
ONP. However, because tor successful and tzsi hybrid izadcn it is necessary to achieve a cenam concentration of 
probes in the hybridizauon liquid, and because probe consumption is very low so that the concentration is onlv verv 
slightly reduced after hybridization, a larger number of filters can be hybridized in a few portions of smaller volume 
ot the same hybridization liquid, which requires a smaller amount of probe or probes. 

By using 30 ONPs per hybridizauon and by repeating one ONP in ifaree combinations so that none of the other 
90 probes is present m two of the three combinations, the number of hybridizations is reduced tenfold at the cost 
of a three limes larger amount of each ONP required. Based on the probability that the combination of a defined 
number of ONPs hybridizes to the fragment of cenomic DNA of defined length, we determined the pcrccnucc of 
information that is lost compared to when each ONP is used separately. 

The average distance between homologous sequ e nces for 30 ONPs with 1 1 nucleotides is about 1 30.000 bp. 
For the sequencing of mammalian genomes, because of the more accurate reading, proponionaily more ONPs with 
a more frequent homologous seqtience, oamdy containing more A and T bases« would be synthesized. Heace, in 
(his case, the avenge distance (5) would be about 100,000 bp. By use of the equation P(D) « 1 - (1 • l/Sf (Eq. 
I), we determined (he probability that a combination of 30 ONPs will hybridize to a genomic DNA fragment of 
length D « 5000 bp. This probability is 0.04S5. The probability that three different combinations will hybridize 
to the same fragment is L25zl0~^. Since 2 million colonies f fragments^ are being hyfaridi»L m mi^tt o^n r^\ frmr t 
aU three combinations that have a common ONP will hybridize to at least one of their pr obe s . For these coiontea, 
wc will not know whether they have a sequence complenntary to ±e common ONP. Because for m«TwwMi«ff 
genomes the nutnber of clones that contam at least one compicmentary ONP sequence that is cotmnon to the three 
combinations is 300 to 30,000, the number of colonies that will also simultaneously hybridize with the common 
probe will m the worst case be less than four. For one million different ONPs this means a m^Ti^wt^ of 4 million 
lost pieces of information assuming tfajt the common ONP docs not hybridize wherever there is ambiguity as to 
whether it hybridizes or not. This represents a loss of only one miilionth part of the information that would be 
obtained by hybridization with each ONP separately. 

Information is more likely to be lost by rqcction of positive hybridization with the ONP that is commtjo to 
the three combinations that hybridize to the given genomic fragment as a result of erroneous determination that each 
combination contains at least one ONP that hybridizes with the given fragment. This error arises in the 
dctemunaiion of positive hybridizauon for each ONP from the three combinations involved when one considers 
whether the other two combinations that contam them also hybridize. If the odier two combinauons hybridize, then 
a high probability exists that the positive hybridization is due to the common ONP. On the other hand, if the 
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combinations are large, ihe probability ts hieher that cwo different probes undergo hybridizauon. each in one 
combination. This would mean that the common ONP probably does not hybridize and. hence, we would not have 
10 reject the injiially cotisidered ONP as the one that docs not hybridize to the given tragmcni. This probability 
(Pgi) can be caicukied approximately by use of the equauon Pgi « {[(D))-x K}^wherem K is the number of ONPs 
in the combtnation and P(D) is the probability that at least one of K ONPs hybridizes to one fragment of eenomic 
DNA having length D (Eq. 1). The formuia is valid for (P(D)^l x K < I. When fragments of length D = 5000 
bp are sequenced, then with combinations having K = 30 ONP, 0. 1 % of the informaiion is lost: with K = 40, ONP 
0.5% is lost: with K = 45 ONP. 1.32% is lost: with K = 50 ONP. 3.3% is lost and with K = 60 ONP. 16% of 
the intormation is lost. It can be concluded that a 10-15-fold reducuon in the required number of hybndizations 
can be achieved with a small loss ot informauon. This umc the number of required filters, namely the number of 
replications to be made of 2 million clones, would also be smaller. 

The total number of hybndizaxion points can be reduced by tising a few hybridization steps with large 
combmaiions. Thus, at the expense of 2-3000 additional hybridiiauons and 2-3 rearrangements of hybridizatioa 
points, each point could be searched with 3-1 times fewer hybridizations, namely the transfers to the filter could 
thixs be reduced this maay times. 

By hyforidizatioa of the isolated plasmid DNA, the faybridtzixioa procedure would be fadiitated, but it wouid 
be necessary to tsoiatea sufficient quantity of plasmid DNA fifotn many doties. The number of clones with 5000-bp 
fngmeats for threefold covering of manmalian genomes ts 2x10*. The required atnounc of DNA from ^rfa Hiwt 
(Mp) is given by the product 



Mp « (Op/DoHF; X Bh X (l/Br) x Md 



where Dp is the size of the chimeric plasmid in bp, is the ONP length, Bh is the number of requized 
hybridizaaons, Br is die number of rehybridizaiions of the tame filter and Md is the amount of DNA that can be 
dete ct ed by the hybridizauon procedure. By taking the most probable values, namely Dp - 8000, « 11, Bh 
« 2x10*, Br « 10 and Md « 0. 1 pg, we fmd that it is necessary to isolate about 0.2 ^g of DNA for ^^h chimeric 
plasmid. Successful rehybridization of filters that have been hybridized witb a biotinylized probe has not been 
developed to date. On the other hand, there are indications diat widi biotinylized probes it is possible to detect as 
tittle as O.OOl pg. Hence, from each of the 2x10* clones it is necessary to isolate about 0.1 to 1 ^g of plasmid 
DNA. 

The amplification of the enure genomic DNA can be accomplished in about one million ponions of a size up 
to 10,000 bp whereby the genome would be covered more than three dmcs. This is accomplished by means of an 
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appropnaieiy chosen mixture or" oliiionucicotidcsa:; pnmeni iour Patent Appiicauon No. 5742 ot March 24. 1987). 
With about iO.OOO different oliconucieocide:; havmi; the compienaentir>" sequence repeated 800 times m the 
nonrepctttive pan ot the mammalian genome ttbr cxampic, a 12-mer wuh C - G from I to 5), it is possible to 
carry out one miilioa ampliftcauon reactions wuh combinations containing 50 primers so thai each pnmer enters 
only once tnto ihe same combinaiion with every other pnmer. With such pnnier combxnauons. there wtli be ta 
average of 60 sites in the genome where two primers will be oriented so that their 3" ends will face each other and 
will be separated by a distance of less than 300 bp. The fragments bertveea these primers will be amplified. 
Because their average length is 150 bp. the total length of the amplified genome is about 9000 bp. One fnillion of 
such amplification reactions replaces the plasmid and phage library of the mammalian genome. In the amplification 
it is not possible to uulize pnmers that enter into highly repetitive sequences < those that are repeated more than 
2-3000 times I : hence only the amplification or seuuencmg of the nonrcpeutive paa of the genome takes place. In 
addition, with 50.000 pnmers with a frequency of 800 in the nonrcpeutive pan of the genome, about 10% of this 
pan of the genome would not enter into the amplification units. With 100.000 pnmers. only 0A% of the 
noorcpetitive pan of the genome would re mam unatnplifled. With 100.000 primers, it is necessary to carry out 4 
tniilion amplification reactions. 

By dot biot hybridizitioa of ajnpiifying reacuoos ^th oUgoaudeottdes that served as phisen and with newly 
symhestzed ONPs tip to the required number of about one nnilion, only the sequeaoes of the amplified fntgoots 
would be read, because with a IxiO'-foId amplification each ONP having a compiementary f^imrr in the ■m p^^ fi*^ 
fragment would have a 3*1000 times larger number of targets than if tt hybridized oidy to the homologous *^«i^*r^ 
in the tmampiified part of the genome. Only a three times stronger signal is expected for 11 -nucleotide ONPs that 
do not contain C or G« and a 1000 times stonger signal for 12-fxien without A or T. It can be from this 
analysis that by sequencing regions rich in A and T it is possible to utilize ONPs longer than 1 1 bp (the 12-iDer 
would give a signal 10 times stronger than the background noise). In this case, it is impossible to utilize ONP 
cotnbinations for hybridization, because the signal would be equal to the backgrtnmd noise, and no possibility exists 
for selective prehybridizatton. 

The advantage of atnplificatioo over cloning is that no living material is tised. This procedtire is much more 
expensive, however, because each primer is consumed in 10 times larger quantity than if it were used only as a 
probe. Moreover, about 10' to 10* enzyme units of the the Klenow fragment of polymerase I are required. 

Beca u s e each genomic ONA fragment hybridizes with all probes, it is necessary, if there ts no rehybridizatioo 
and if probe combinations are not used for hybridization, to apply each colony or ^^rh isolated DNA or 
amplification reaction to about one million filters for about one million probes. This would be done by simultaneous 
automatic application of a large number of samples (about 100). With DNA. this is much caster than by making 
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woiony rcpiicas. Most likeiy. tht coionici; wouid be autonuiicaiiy setttied into abouc one miiiion rcpiicas or cadi 
ot the approximately 2 miUion clones by taking a minimum araounc of baciena irora clones grown on microtitrauon 
plaifis. To avoid removing colonic:* rrom Peln dishes, the transt'ormauon mixture can be diluted by seeding the 
specified volume into a hole of the micropiate so that one or no (ransformed cell is seeded. To eiiminaie empty 
holes, a transpianution would then be penbrmed from holes with viable growth to a new plate. The most difficult 
condition is the need to achieve approximately (he same growth of all colonies on all filters. 

If there IS no rehybridizauon and if probe combinations are not used, the total number of hybridization points 
equals the product of the number of genomic DNA fragments (colonics, clones, amplification reacuotis) by the 
number of ONPs. For mamnulian genomes this amounts to about 10*^ points. If each point requires about 3 mnr. 
about JxlO** m^of filters will be needed. With 10 rehybridizations and a 20*fold reduction in the number of 
hybridizations per fragment, about 15.000 nrot' filters is required. 

Hybridization with all ONPs of the same length would be carried out at the same temperanire un^er conditions 
that eliminate the effect of the C -i- G composiuon (Wood, W., et Proc, NaU, Acad. Sci. 82, 1585 (1985)], 
For 1 l-iDcrs. the hybridization and the washing would be caxried out at 20 *C. For bioctnyiixed probes, whtcfa 
require about 2 sg of probe per cnrof filter, an aommt of one to three optical units of each ONP (50 fig) woold 
be sufficient for the sequencing of a maoimalian gettoae ptovided the hybridizatitm liquid is used only otice. By 
simultaneous sytitfaesis of 10 optical tinits and possibly by simnltaneous hybridizttion, the sequencing of individual 
gesomes could be sisipiified, accelerated and made less expensive. 

The cost of sequencing per genome the size of a genotne would not exceed 1(X) million doUan. 

This is 5 times less expensive than the costs estimated within the framework of the Japanese project. We also 
believe that the total time need e d for the sequencing of a genotne including ONP synthesis is shorter and iin ryraf 
to about 1-2 yeais. 

Because as many genomic fragments are taken for sequencing as are necessary for each fragment to overlap 
the neighboring one at least slightly, from the y^vmmi ft ag me ncs one obtains by overlapping over homologous 
s equences at the ends of the fragments an arranged library of fragments (clones) and the sequences of each 
chromosome. This is not so in the sequencing of amplified fragments, because it is possible to amplify and to 
sequence only fragments that do not contain, or do not belong to. repetiuve seqtiences. In this case, by arranging 
the sequenced fragments, one would obtain only regions between repetitive neighboring sequences. 
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It appeanj thai the opiinium procedure ror s<:uucncing j:cnomcs by the method of hybridizauon wuh ONPs 
IS by colony hybndization or' clones iarcer than iOOO bp wuh about 300,000 to iOO.OOO ONPs wuh a Icneth or 
lO-U nucleotides in aboui 50.000 separate hybndizanons wuh combtnations or about 30-50 ONPs so distributed 
that each ONP is repeated in three comfainauons wheretn the other 90 ONPs are reprcsemed only in one of the three 
irtven combinations. The possibility of detection of a very small amount of DNA by use of biotinyiized probes, by 
increasing the number of rehybridization of a filter and by reducing the number of hybridizations per fracment in 
elimination hybridizations with combinations of about 100-500 ONPs. however, makes it possible to reduce the 
required amount of DNA to less than I ^g. This quanuty of plasmids can be isolated from bacteriai cultures grown 
in one hole of a microtitraiion plaie. We can also visualize simple, crude isolation of plasmid DNA. which could 
even be easier than growing colonies in thousands of replications. The enure isolation procedure would be carried 
out on microtitrauon plates. Ccntnhicauon in the nucroci (ration plates would remove the medium, and ailcaiine lysis 
and denaturauon of the protein wuh acidic sodium acetate would give the cell membrane chromosomal prccipitaic. 
which would be retnovod by ceatnfugation. The supernatant would be denatured with sodium hydroxide and 
transferred to the filter in the form of a sufficient number of dots. It would be easy to introduce the steps of 
alcoholic precipitation and Ircatmeol of the prcparauons with the RNA-se enzyme if it were necessary to reduce the 
backgrotind noise from the hybridization with bacxeh^ RNA. This method of isolation of plasmid DNA rtukes dot 
bloc bybritiizuioo more advaatigeous than colony faybridizzxion. 
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1 . The procedure ot genome sct^uencinn by hybridization with oiigonucieoiide probes, charactenzcd in 

that genomic DNA fragments containmg 100 to 20,000 bp, obtained m suificient amount by cloning into vectors 
that rcpiicaie m £. coli of by ampiificauon of genomic DNA with nuxnires of oligonucicotide pnmers. or by the 
procedure of colony hybridizadon without or with selective prefaybridization of nonbioiinyiized bacterial DNA. or 
by dot blot hybridization of isolated chimeric DNA vector inserts, or by dot blot hybridization of DNA from 
amplification reactions, under hybridization conditions permitting only the hybridizaiion of sequences with compieie 
homology, ;ire hybridized to 100.000 to i.OOO.OOO biounyiized oligonucleotide probes of different sequence length 
ranging from 10 to 13 nucleotides, each probe being hybridized scpMrzuiy or in combinations of 10 to 500 probes 
and that the croups of oligonucleotide ^uenccs that are located in individual genotnic DNA fraemenis are 
deicnnined by detection of the bound biotin, the arrangetnent of said fragments over overiapping sequences then 
giving the order of the nudeoddcs in said fragments. 

2, The prtxedure according to Claim 1, char a ar ri zed in that one deduces from the combinations of 
oUgonuclcottde probes that hybridize to one fragment the oUgoaitcieocide ptobes that pnxiuce hybridizadoo by 
ciiimiuiion of those probes whose other combmattoos in which they are present do not hybridize co the gtvta 
fingmentorthattnall cotnhtnations containing the given piribe and whic^ 

piwent at least one additionai probe all combinations of which that contain it hybridizing to tfag pv«, fag ,^ 

3. Hie procedure according to Clainis 1 and 2, rhiTmrtrriTrri in thai the oiigonudeocide probes that 
hybridize to the given genotmc DNA fragment arrange themselves into one or several arrays in die alpfaabcdcai 
order of the lettered pan and according to increasing value of the numbeted pan of their marking which they 
acquired on the basis of possible overiap widi the uuiized probes, and that the arrays of markings are dectphexed 
by means of a reverse algorithm to give the order of ntideoddes. 



TTie procedure acconling to Claims 1 through 3, characterized in that by detecting identical 
»vcriapping. tenmnal sequences between sequenced fragments, sequenced clones or ampliHed fragments are 
arranged in an array, and the overall sequence of each chromosome of the given genome is determmed. 
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ABSTRACT 

The cxisccacc of given oiiconuclcocide sequences in genomic DNA tragments is dctennmed by coionv 
hyfaridizaiion or dot blot hybndizauon or genomic DNA rVagmenis wiih 3000 to 15,000 bp, obuined by cloniag or 
amplification or* oliconudcotide probes containing 300.000 to 3 million biotinyiized probes of 10 to 12 nucicotidcs, 
individually or in combinations of 10 to 500 ONPs. under tMnditions penniuing hybridization only with compictclv 
homologous sequetices. By arrancinc the detected sequences over identical regions, the sequence of each individual 
DNA fragment is determined. By analyzing the number of fragments that cover the genome three times, »^^h 
fagmeni can be made to overlap on both Sides wah at least one fragment. By defecting the fragments with idenucal 
icnninal sequences, cbrotnosome libraries and chromosomal sequences arc obuined. This procedure is more thap 
5 times less expensive and faster than standard automated sequencing procedures. 



INDUSTRUL USE OF THE DEVELOPED GENOME SEQUENCING PROCEDURE 

Based on the described s e q uen ci ng pro c ed u re, it is possible to build a plain for sequeacing genomic DNA. 
In our optnioiu a Isrge sdeodficHecfanoiogical maxket will exist for genatnic and sequenced geaomxc 

fngmoxcs of it least 50 cfaaxvoeristic spedes. 

la sdditioa to its economic justificaxioo, such a plaiU would enable a large number of sciestists to study the 
ftmctions of genoonc DNA fragments of known sequesoe and by procedures of genetic engineering co create new, 
useftil combsnations of getxetic materia l s instea d of cloning and sequencing certain genes at lower efficacy. 
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